-
Notifications
You must be signed in to change notification settings - Fork 726
JSON subcolumns store change from string to binary #27653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON subcolumns store change from string to binary #27653
Conversation
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🟢 |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
⚪ Ya make output | Test bloat | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
ydb/core/tx/columnshard/engines/storage/indexes/portions/extractor/sub_column.cpp
Show resolved
Hide resolved
| const ui32 size = filterBits.Size(); | ||
| for (ui64 i = 0; i < HashesCount; ++i) { | ||
| const ui64 hash = NArrow::NHash::TXX64::CalcSimple(data, i); | ||
| const ui64 hash = NArrow::NHash::TXX64::CalcSimple(data.GetStringRobust(), i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
а GetStringRobust прямо даже вложенный json засериализует?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Да, он что угодно засериализует. Смысл в том, что если там строка, то он построит индекс по строке без кавычек, а если что-то другое, то он преобразует в нормальный json и построит индексы по нему. Должно работать как раньше, так как сюда приходило всё в виде строки
| AFL_VERIFY(CurrentIndex <= GlobalChunkedArray->GetRecordsCount())("index", CurrentIndex)("count", GlobalChunkedArray->GetRecordsCount()); | ||
| } | ||
|
|
||
| NJson::TJsonValue TColumnsData::TIterator::GetValue() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Я кстати не уверен что NJson::TJsonValue это самый эффективный способ хранения json, можно будет поисследовать другие либы как они по префу. Это уже отдельно можно провернуть
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a change to JSON data serialization in the columnar storage engine, transitioning from string-based to binary JSON representation. The key changes include converting JSON values to binary format during storage and properly handling JSON types during data extraction and filtering operations.
Key changes:
- JSON values are now stored as
arrow::BinaryTypeinstead ofarrow::StringType, using binary JSON serialization for efficient storage - Data extractors now return
NJson::TJsonValueinstead ofstd::string_viewto preserve type information - Added conversion logic to properly deserialize binary JSON to string representation when extracting column paths
Reviewed Changes
Copilot reviewed 43 out of 44 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ydb/tests/olap/chunks/* | New test infrastructure for chunked data scenarios with JSON columns |
| ydb/core/tx/columnshard/engines/storage/indexes/portions/extractor/* | Updated extractors to use NJson::TJsonValue for record visitors |
| ydb/core/tx/columnshard/engines/storage/indexes/*/meta.cpp | Updated bloom filter implementations to work with JSON values |
| ydb/core/formats/arrow/accessor/sub_columns/* | Core changes to use binary storage for JSON data and proper deserialization |
| ydb/core/formats/arrow/program/kernel_logic.cpp | Added binary JSON to string conversion for path extraction |
| ydb/core/kqp/ut/olap/json_ut.cpp | Updated test expectations to reflect proper JSON type preservation |
| ydb/core/pyrightconfig.json | Configuration file with hardcoded paths |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ydb/core/tx/columnshard/engines/scheme/versions/abstract_scheme.cpp
Outdated
Show resolved
Hide resolved
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 34 out of 35 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ui32 prevIndex = 0; | ||
| for (ui32 idx = 0; idx < UI32ColIndex->length(); ++idx) { | ||
| auto currentIndex = UI32ColIndex->Value(idx); | ||
| for (ui32 i = prevIndex; i < currentIndex; ++i) { | ||
| visitor(DefaultsArray); | ||
| } | ||
| visitor(ColValue->Slice(idx, 1)); | ||
| prevIndex = currentIndex + 1; | ||
| } | ||
| for (; prevIndex < RecordsCount; ++prevIndex) { | ||
| visitor(DefaultsArray); | ||
| } |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VisitValues implementation creates array slices (ColValue->Slice(idx, 1)) in a loop which could be inefficient for large datasets. Consider batching or accessing the underlying array data directly to avoid repeated slice operations.
| } else if (container.GetType() == NBinaryJson::EContainerType::Array) { | ||
| iterators.emplace_back(std::make_unique<TArrayExtractor>(container.GetArrayIterator(), key)); | ||
| if (FirstLevelOnly || container.GetType() == NBinaryJson::EContainerType::Array) { | ||
| res = NBinaryJson::SerializeToBinaryJson(NBinaryJson::SerializeToJson(container), false); |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This performs a round-trip serialization: binary JSON -> JSON string -> binary JSON. This is inefficient. Consider storing the binary JSON container directly without the intermediate JSON string conversion.
| res = NBinaryJson::SerializeToBinaryJson(NBinaryJson::SerializeToJson(container), false); | |
| res = container; |
| auto future = NRpcService::DoLocalRpc<TEvBulkUpsertRequest>(std::move(request), "", "", runtime->GetActorSystem(0)); | ||
| future.Subscribe([&](const NThreading::TFuture<Ydb::Table::BulkUpsertResponse> f) mutable { | ||
| responses.fetch_add(1); | ||
| future.Subscribe([&](const NThreading::TFuture<Ydb::Table::BulkUpsertResponse> f) { |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda captures local variables by reference (&) but the future is asynchronous. If the function returns before the callback executes, these references will be dangling. The atomic 'responses' variable and other locals may be destroyed before the callback runs. Consider capturing by value or using shared_ptr for the required state.
| std::string_view view = heap.back().GetValuePointer() ? | ||
| std::string_view(heap.back().GetValuePointer()->Data(), heap.back().GetValuePointer()->Size()) : ""; | ||
| std::string_view* viewPtr = heap.back().GetValuePointer() ? &view : nullptr; |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code calls heap.back().GetValuePointer() multiple times (4 times total). Consider storing the result in a variable to improve readability and avoid redundant calls.
| std::string_view view = heap.back().GetValuePointer() ? | |
| std::string_view(heap.back().GetValuePointer()->Data(), heap.back().GetValuePointer()->Size()) : ""; | |
| std::string_view* viewPtr = heap.back().GetValuePointer() ? &view : nullptr; | |
| auto* valuePtr = heap.back().GetValuePointer(); | |
| std::string_view view = valuePtr ? std::string_view(valuePtr->Data(), valuePtr->Size()) : ""; | |
| std::string_view* viewPtr = valuePtr ? &view : nullptr; |
|
мне ок, но там еще copilot предложил разумные замечания, оставляю на твое усмотрение |
Changelog entry
...
Changelog category
Description for reviewers
...